Simulation Studies for Evaluating ForestSearch Performance

Operating Characteristics and Power Analysis

Author

ForestSearch Package

Published

January 23, 2026

1 Introduction

This vignette demonstrates how to conduct simulation studies to evaluate the performance of ForestSearch for identifying subgroups with differential treatment effects. The simulation framework allows you to:

  • Generate synthetic clinical trial data with known treatment effect heterogeneity
  • Evaluate subgroup identification rates (power)
  • Assess classification accuracy (sensitivity, specificity, PPV, NPV)
  • Compare different analysis methods (ForestSearch, GRF)
  • Estimate Type I error under null hypothesis

1.1 Simulation Framework Overview

The simulation workflow consists of four main steps:

flowchart LR
    A[Create DGM] --> B[Simulate Trials]
    B --> C[Run Analyses]
    C --> D[Summarize Results]

  1. Create DGM: Define a data generating mechanism with specified treatment effects
  2. Simulate Trials: Generate multiple simulated datasets
  3. Run Analyses: Apply ForestSearch (and optionally GRF) to each dataset
  4. Summarize Results: Aggregate operating characteristics across simulations

2 Setup

# Core packages
library(forestsearch)
library(weightedsurv)

library(data.table)
library(survival)
library(ggplot2)
library(gt)

# Parallel processing
library(foreach)
library(doFuture)
library(future)

# Source simulation functions (if not yet in package)
# source("sim_aft_gbsg_refactored.R")
# source("oc_analyses_refactored.R")

3 Creating a Data Generating Mechanism

The simulation framework uses the German Breast Cancer Study Group (GBSG) dataset as a template for realistic covariate distributions and censoring patterns.

3.1 Understanding the DGM

The create_gbsg_dgm() function creates a data generating mechanism (DGM) based on an Accelerated Failure Time (AFT) model with Weibull distribution. Key features:

  • Covariates: Age, estrogen receptor, menopausal status, progesterone receptor, nodes
  • Treatment effect heterogeneity: Specified via interaction terms
  • Subgroup definition: H = {low estrogen receptor AND premenopausal}
  • Censoring: Weibull or uniform censoring model

3.2 Alternative Hypothesis (Heterogeneous Treatment Effect)

Under the alternative hypothesis, we create a DGM where the treatment effect varies across patient subgroups:

# Create DGM with heterogeneous treatment effect
# HR in harm subgroup (H) will be > 1 (treatment harmful)
# HR in complement (H^c) will be < 1 (treatment beneficial)

dgm_alt <- create_gbsg_dgm(
  model = "alt",
  k_treat = 1.0,
  k_inter = 2.0,      # Interaction effect multiplier
  k_z3 = 1.0,
  z1_quantile = 0.25, # ER threshold at 25th percentile
  n_super = 5000,
  cens_type = "weibull",
  seed = 8316951,
  verbose = TRUE
)

# Examine the DGM
print(dgm_alt)
GBSG-Based AFT Data Generating Mechanism
=========================================

Model type: alt 
Super-population size: 5000 

Hazard Ratios:
  Overall (causal): 0.709 
  Harm subgroup (H): 3.074 
  Complement (H^c): 0.635 

Subgroup definition: z1 == 1 & z3 == 1 (low ER & premenopausal) 
Analysis variables: v1, v2, v3, v4, v5, v6, v7 

3.3 Calibrating for a Target Hazard Ratio

Often, you want to specify the exact hazard ratio in the harm subgroup. Use calibrate_k_inter() to find the interaction parameter that achieves this:

# Find k_inter for HR = 1.5 in harm subgroup
k_inter_target <- calibrate_k_inter(
  target_hr_harm = 2.0,
  model = "alt",
  k_treat = 1.5,
  cens_type = "weibull",
  verbose = TRUE
)

# Create DGM with calibrated k_inter
dgm_calibrated <- create_gbsg_dgm(
  model = "alt",
  k_treat = 1.5,
  k_inter = k_inter_target,
  verbose = TRUE
)

cat("\nVerification:\n")

Verification:
cat("Target HR(H):", 1.5, "\n")
Target HR(H): 1.5 
cat("Achieved HR(H):", round(dgm_calibrated$hr_H_true, 3), "\n")
Achieved HR(H): 2 
cat("HR(H^c):", round(dgm_calibrated$hr_Hc_true, 3), "\n")
HR(H^c): 0.517 
cat("Overall HR:", round(dgm_calibrated$hr_causal, 3), "\n")
Overall HR: 0.58 

3.4 Null Hypothesis (Uniform Treatment Effect)

For Type I error evaluation, create a DGM with uniform treatment effect:

# Create null DGM (no treatment effect heterogeneity)
dgm_null <- create_gbsg_dgm(
  model = "null",
  k_treat = 1.5,
  verbose = TRUE
)

cat("\nNull hypothesis HRs:\n")

Null hypothesis HRs:
cat("Overall HR:", round(dgm_null$hr_causal, 3), "\n")
Overall HR: 0.593 
cat("HR(H^c):", round(dgm_null$hr_Hc_true, 3), "\n")
HR(H^c): 0.593 

4 Simulating Trial Data

4.1 Single Trial Simulation

Use simulate_from_gbsg_dgm() to generate a single simulated trial:

# Simulate a single trial
sim_data <- simulate_from_gbsg_dgm(
  dgm = dgm_calibrated,
  n = 700,
  rand_ratio = 1,        # 1:1 randomization
  sim_id = 1,
  max_follow = 84,       # 84 months administrative censoring
  muC_adj = log(1.5)            # No censoring adjustment
)

# Examine the data
cat("Simulated trial:\n")
Simulated trial:
cat("  N =", nrow(sim_data), "\n")
  N = 700 
cat("  Events =", sum(sim_data$event.sim), 
    "(", round(100 * mean(sim_data$event.sim), 1), "%)\n")
  Events = 356 ( 50.9 %)
cat("  Harm subgroup size =", sum(sim_data$flag.harm),
    "(", round(100 * mean(sim_data$flag.harm), 1), "%)\n")
  Harm subgroup size = 85 ( 12.1 %)
# Quick survival analysis
fit_itt <- coxph(Surv(y.sim, event.sim) ~ treat, data = sim_data)
cat("  Estimated ITT HR =", round(exp(coef(fit_itt)), 3), "\n")
  Estimated ITT HR = 0.677 

4.2 Examining Covariate Structure

dfcount <- df_counting(
  df = sim_data,
  by.risk = 6,
  tte.name = "y.sim", 
  event.name = "event.sim", 
  treat.name = "treat"
)
plot_weighted_km(dfcount, conf.int = TRUE, show.logrank = TRUE, ymax = 1.05, xmed.fraction = 0.775, ymed.offset = 0.125)

create_summary_table(data = sim_data, treat_var = "treat", 
                     table_title = "Characteristics by Treatment Arm",
                                      vars_continuous=c("z1","z2","size","z3","z4","z5"),
                                      vars_categorical=c("flag.harm","grade3"),
                                      font_size = 12)
Characteristics by Treatment Arm
Characteristic Control (n=350) Treatment (n=350) P-value1 SMD2
z1 Mean (SD) 0.3 (0.4) 0.3 (0.4) 0.931 0.01
z2 Mean (SD) 2.5 (1.1) 2.4 (1.1) 0.345 0.07
size Mean (SD) 28.9 (12.9) 31.0 (16.2) 0.060 0.14
z3 Mean (SD) 0.4 (0.5) 0.5 (0.5) 0.253 0.09
z4 Mean (SD) 2.5 (1.1) 2.6 (1.1) 0.761 0.02
z5 Mean (SD) 2.5 (1.1) 2.4 (1.1) 0.367 0.07
flag.harm 0.165 0.05
0 314 (89.7%) 301 (86.0%)
1 36 (10.3%) 49 (14.0%)
grade3 1.000 0.00
0 268 (76.6%) 269 (76.9%)
1 82 (23.4%) 81 (23.1%)
1 P-values: t-test for continuous, chi-square/Fisher's exact for categorical/binary variables
2 SMD = Standardized mean difference (Cohen's d for continuous, Cramer's V for categorical)

5 Running Simulation Studies

5.1 Setting Up Parallel Processing

For efficient simulation studies, use parallel processing:

# Configure parallel backend
n_workers <- min(parallel::detectCores() - 1, 120)

plan(multisession, workers = n_workers)
registerDoFuture()

cat("Using", n_workers, "parallel workers\n")
Using 120 parallel workers

5.2 Define Simulation Parameters

# Simulation settings
sim_config_alt <- list(
  n_sims = 2000,          # Number of simulations (use 500-1000 for final)
  n_sample = 700,        # Sample size per trial
  max_follow = 84,       # Maximum follow-up (months)
  seed_base = 8316951,
  muC_adj = log(1.5)
)

sim_config_null <- list(
  n_sims = 2500,          # Number of simulations (use 500-1000 for final)
  n_sample = 700,        # Sample size per trial
  max_follow = 84,       # Maximum follow-up (months)
  seed_base = 8316951,
  muC_adj = log(1.5)
)


# ForestSearch parameters
fs_params <- list(
  outcome.name = "y.sim",
  event.name = "event.sim",
  treat.name = "treat",
  id.name = "id",
  hr.threshold = 1.25,
  hr.consistency = 1.0,
  pconsistency.threshold = 0.90,
  use_grf = TRUE,
  use_lasso = FALSE,
  fs.splits = 400,
  n.min = 60,
  maxk = 2
)

# Confounders for analysis
confounders_base <- c("z1", "z2", "z3", "z4", "z5", "size", "grade3")

5.3 Running the Simulation Loop

# Track timing
t_start <- Sys.time()

sim_config <- sim_config_alt

# Run simulations in parallel
results_alt <- foreach(
  sim = 1:sim_config$n_sims,
  .combine = rbind,
  .errorhandling = "remove",
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
# Run single simulation analysis
run_simulation_analysis(
    sim_id = sim,
    dgm = dgm_calibrated,
    n_sample = sim_config$n_sample,
    max_follow = sim_config$max_follow,
    muC_adj = sim_config$muC_adj,
    confounders_base = confounders_base,
    n_add_noise = 0,
    run_fs = TRUE,
    run_fs_grf = FALSE,
    run_grf = TRUE,      # Set TRUE if grf_subg_harm_survival available
    fs_params = fs_params,
    n_sims_total = sim_config$n_sims,
    seed_base = sim_config$seed_base,
    verbose = TRUE,
    verbose_n = 1
  )
}

=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30 
Maximum total splits: 400 
Batch size: 20 
================================================

GRF stage for cut selection with dmin, tau = 12 0.6 
tau, maxdepth = 48.68693 2 
   leaf.node control.mean control.size control.se depth
1          2         2.02       187.00       1.97     1
2          3        -5.93       513.00       1.23     1
21         5        -7.11       403.00       1.38     2
3          6        -3.22       189.00       1.78     2
4          7         8.08        95.00       3.22     2

Selected subgroup:
  leaf.node control.mean control.size control.se depth
4         7         8.08        95.00       3.22     2

GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"

All splits:
[1] "z2 <= 1" "z3 <= 0" "z1 <= 0"
GRF cuts identified: 3 
  Cuts: z2 <= 1, z3 <= 0, z1 <= 0 
# of continuous/categorical characteristics 4 3 
Continuous characteristics: z2 z4 z5 size 
Categorical characteristics: z1 z3 grade3 
Default cuts included (1st 20) z2 <= mean(z2) z2 <= median(z2) z2 <= qlow(z2) z2 <= qhigh(z2) z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) size <= mean(size) size <= median(size) size <= qlow(size) size <= qhigh(size) 
Categorical: z1 z3 grade3 
Factors per GRF: z2 <= 1 z3 <= 0 z1 <= 0 
Initial GRF cuts included z2 <= 1 z3 <= 0 z1 <= 0 

===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 18 cut expressions once and caching...
Cut evaluation summary:
  Total cuts:  18 
  Valid cuts:  17 
  Errors:  0 
Dropping variables (cut only has 1 level): z4 <= 4 
Total cuts after dropping invalid: 17
✓ All 17 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====

# of candidate subgroup factors= 17 
 [1] "z2 <= 1"      "z2 <= 2.5"    "z2 <= 2"      "z2 <= 3.2"    "z4 <= 2.5"   
 [6] "z4 <= 1"      "z5 <= 2.4"    "z5 <= 2"      "z5 <= 1"      "z5 <= 3"     
[11] "size <= 29.4" "size <= 25"   "size <= 20"   "size <= 35"   "z1"          
[16] "z3"           "grade3"      
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 595 
Events criteria: control >= 12 , treatment >= 12 
Subgroup search completed in 0.13 minutes
Found 9 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 9 
Random seed set to: 8316951 
Removed 2 near-duplicate subgroups
Original rows: 9 
After removal: 7 
# of unique initial candidates: 7 
# Restricting to top stop_Kgroups = 10 
# of candidates to evaluate: 7 
Algorithm: Two-stage sequential 
  Stage 1 splits: 30 
  Screen threshold: 0.763 
  Max total splits: 400 
  Batch size: 20 
Parallel processing: callr with 6 workers
SG focus= hr 
Subgroup Consistency Minutes= 0.461 
Algorithm used: Two-stage sequential 
Candidates evaluated: 7 
Candidates passed: 2 
Subgroup found (FS) with sg_focus='hr'
Selected subgroup: {z3} & {z1} 
Minutes forestsearch overall = 0.66 
Consistency algorithm used: twostage 
tau, maxdepth = 48.68693 2 
   leaf.node control.mean control.size control.se depth
1          2         2.02       187.00       1.97     1
2          3        -5.93       513.00       1.23     1
21         5        -7.11       403.00       1.38     2
3          6        -3.22       189.00       1.78     2
4          7         8.08        95.00       3.22     2

Selected subgroup:
  leaf.node control.mean control.size control.se depth
4         7         8.08        95.00       3.22     2

GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"

All splits:
[1] "z2 <= 1" "z3 <= 0" "z1 <= 0"
# Calculate runtime
t_end <- Sys.time()
runtime_mins <- as.numeric(difftime(t_end, t_start, units = "mins"))

cat("\n=== Simulation Complete ===\n")

=== Simulation Complete ===
cat("Total simulations:", sim_config$n_sims, "\n")
Total simulations: 2000 
cat("Successful:", nrow(results_alt) / length(unique(results_alt$analysis)), "\n")
Successful: 2000 
cat("Runtime:", round(runtime_mins, 2), "minutes\n")
Runtime: 13.75 minutes
cat("Per simulation:", round(runtime_mins / sim_config$n_sims * 60, 1), "seconds\n")
Per simulation: 0.4 seconds

5.4 Running Null Hypothesis Simulations (Type I Error)

sim_config <- sim_config_null

# Track timing
t_start <- Sys.time()

# Run null hypothesis simulations
results_null <- foreach(
  sim = 1:sim_config$n_sims,
  .combine = rbind,
  .errorhandling = "remove",
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
  
  run_simulation_analysis(
    sim_id = sim,
    dgm = dgm_null,
    n_sample = sim_config$n_sample,
    max_follow = sim_config$max_follow,
    muC_adj = sim_config$muC_adj,
    confounders_base = confounders_base,
    run_fs = TRUE,
    run_fs_grf = FALSE,
    run_grf = TRUE,
    fs_params = fs_params,
    verbose = TRUE,
    verbose_n = 1
  )
}

=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30 
Maximum total splits: 400 
Batch size: 20 
================================================

GRF stage for cut selection with dmin, tau = 12 0.6 
tau, maxdepth = 47.90017 2 
   leaf.node control.mean control.size control.se depth
1          2        -6.49       517.00       1.23     1
2          3         0.17       183.00       1.63     1
21         5        -8.12       324.00       1.56     2
3          6         0.96       171.00       1.73     2
4          7        -5.64       179.00       1.92     2
GRF subgroup NOT found
GRF cuts identified: 0 
# of continuous/categorical characteristics 4 3 
Continuous characteristics: z2 z4 z5 size 
Categorical characteristics: z1 z3 grade3 
Default cuts included (1st 20) z2 <= mean(z2) z2 <= median(z2) z2 <= qlow(z2) z2 <= qhigh(z2) z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) size <= mean(size) size <= median(size) size <= qlow(size) size <= qhigh(size) 
Categorical: z1 z3 grade3 

===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 18 cut expressions once and caching...
Cut evaluation summary:
  Total cuts:  18 
  Valid cuts:  17 
  Errors:  0 
Dropping variables (cut only has 1 level): z4 <= 4 
Total cuts after dropping invalid: 17
✓ All 17 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====

# of candidate subgroup factors= 17 
 [1] "z2 <= 2.5"    "z2 <= 2"      "z2 <= 1"      "z2 <= 3.2"    "z4 <= 2.5"   
 [6] "z4 <= 1"      "z5 <= 2.4"    "z5 <= 2"      "z5 <= 1"      "z5 <= 3"     
[11] "size <= 29.4" "size <= 25"   "size <= 20"   "size <= 35"   "z1"          
[16] "z3"           "grade3"      
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 595 
Events criteria: control >= 12 , treatment >= 12 
Subgroup search completed in 0.14 minutes
NO subgroup candidates found
tau, maxdepth = 47.90017 2 
   leaf.node control.mean control.size control.se depth
1          2        -6.49       517.00       1.23     1
2          3         0.17       183.00       1.63     1
21         5        -8.12       324.00       1.56     2
3          6         0.96       171.00       1.73     2
4          7        -5.64       179.00       1.92     2
GRF subgroup NOT found
cat("Null simulations completed:", 
    nrow(results_null) / length(unique(results_null$analysis)), "\n")
Null simulations completed: 2500 
# Calculate runtime
t_end <- Sys.time()
runtime_mins <- as.numeric(difftime(t_end, t_start, units = "mins"))

cat("\n=== Simulation Complete ===\n")

=== Simulation Complete ===
cat("Total simulations:", sim_config$n_sims, "\n")
Total simulations: 2500 
cat("Successful:", nrow(results_null) / length(unique(results_null$analysis)), "\n")
Successful: 2500 
cat("Runtime:", round(runtime_mins, 2), "minutes\n")
Runtime: 6.04 minutes
cat("Per simulation:", round(runtime_mins / sim_config$n_sims * 60, 1), "seconds\n")
Per simulation: 0.1 seconds

6 Analyzing Simulation Results

6.1 Summary Statistics

# Summarize alternative hypothesis results
summary_alt <- summarize_simulation_results(
  results = results_alt,
  digits = 2,
  digits_hr = 3
)

# Summarize null hypothesis results  
summary_null <- summarize_simulation_results(
  results = results_null,
  digits = 2,
  digits_hr = 3
)

# Display summaries
cat("=== Alternative Hypothesis (Power) ===\n")
=== Alternative Hypothesis (Power) ===
print(summary_alt)
                    FS     GRF
any.H            0.910   0.810
sensH            0.960   0.930
sensHc           0.990   0.990
ppH              0.920   0.940
ppHc             0.990   0.990
Avg(#H)         84.000  91.000
minH            61.000  60.000
maxH           156.000 201.000
Avg(#Hc)       623.000 627.000
minHc          544.000 499.000
maxHc          700.000 700.000
hat(H*)          2.284   2.178
hat(hat[H])      2.284   2.178
hat(Hc*)         0.517   0.517
hat(hat[Hc])     0.507   0.500
hat(H*)all       2.284   2.178
hat(Hc*)all      0.517   0.517
hat(ITT)all      0.611   0.611
hat(ITTadj)all     NaN     NaN
cat("\n=== Null Hypothesis (Type I Error) ===\n")

=== Null Hypothesis (Type I Error) ===
print(summary_null)
                    FS     GRF
any.H            0.010   0.020
sensH            0.000   0.000
sensHc           1.000   1.000
ppH                NaN     NaN
ppHc             0.870   0.900
Avg(#H)         93.000  73.000
minH            61.000  60.000
maxH           152.000 126.000
Avg(#Hc)       699.000 698.000
minHc          548.000 574.000
maxHc          700.000 700.000
hat(H*)          1.786   1.474
hat(hat[H])      1.786   1.474
hat(Hc*)         0.593   0.593
hat(hat[Hc])     0.575   0.562
hat(H*)all       1.786   1.474
hat(Hc*)all      0.593   0.593
hat(ITT)all      0.580   0.580
hat(ITTadj)all     NaN     NaN

6.2 Formatted Tables with gt

# Create formatted table for alternative hypothesis

tbl_alt <- format_oc_results(
  results = results_alt,
  digits = 2,
  digits_hr = 3
)

tbl_alt
Operating Characteristics Summary
Method N Sims Detection
Classification
Hazard Ratios
Size_H_mean Size_H_min Size_H_max
Sensitivity Specificity PPV NPV HR_H_hat HR_Hc_hat HR_H_true HR_Hc_true HR_ITT
FS 2000 0.91 0.92 0.99 0.96 0.99 2.284 0.507 2.284 0.517 0.611 84 61 156
GRF 2000 0.81 0.94 0.99 0.93 0.99 2.178 0.500 2.178 0.517 0.611 91 60 201
# Create formatted table for null hypothesis
tbl_null <- format_oc_results(
  results = results_null,
  digits = 2,
  digits_hr =3
)

tbl_null
Operating Characteristics Summary
Method N Sims Detection
Classification
Hazard Ratios
Size_H_mean Size_H_min Size_H_max
Sensitivity Specificity PPV NPV HR_H_hat HR_Hc_hat HR_H_true HR_Hc_true HR_ITT
FS 2500 0.01 - 0.87 0.00 1.00 1.786 0.575 1.786 0.593 0.580 93 61 152
GRF 2500 0.02 - 0.90 0.00 1.00 1.474 0.562 1.474 0.593 0.580 73 60 126

6.3 Key Operating Characteristics

# Extract key metrics for comparison
methods <- unique(results_alt$analysis)

metrics_comparison <- data.frame(
  Method = methods,
  Power = sapply(methods, function(m) {
    mean(results_alt[results_alt$analysis == m, ]$any.H)
  }),
  TypeI = sapply(methods, function(m) {
    mean(results_null[results_null$analysis == m, ]$any.H)
  }),
  Sensitivity = sapply(methods, function(m) {
    mean(results_alt[results_alt$analysis == m, ]$sensitivity, na.rm = TRUE)
  }),
  PPV = sapply(methods, function(m) {
    mean(results_alt[results_alt$analysis == m, ]$ppv, na.rm = TRUE)
  })
)

metrics_comparison |>
  gt() |>
  tab_header(
    title = "Method Comparison",
    subtitle = sprintf("N = %d, n = %d per trial", 
                       sim_config$n_sims, sim_config$n_sample)
  ) |>
  fmt_percent(columns = c(Power, TypeI, Sensitivity, PPV), decimals = 1) |>
  cols_label(
    Method = "Analysis Method",
    Power = "Power",
    TypeI = "Type I Error",
    Sensitivity = "Sensitivity",
    PPV = "PPV"
  )
Method Comparison
N = 2500, n = 700 per trial
Analysis Method Power Type I Error Sensitivity PPV
FS 91.1% 1.2% 91.6% 96.2%
GRF 80.5% 2.1% 94.2% 92.9%

7 Visualizing Results

7.1 Subgroup Detection Rates

# Detection rates by method
detection_data <- results_alt[, .(
  detection_rate = mean(any.H),
  se = sqrt(mean(any.H) * (1 - mean(any.H)) / .N)
), by = analysis]

ggplot(detection_data, aes(x = analysis, y = detection_rate, fill = analysis)) +
  geom_col(width = 0.6) +
  geom_errorbar(
    aes(ymin = detection_rate - 1.96 * se, 
        ymax = detection_rate + 1.96 * se),
    width = 0.2
  ) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "red", alpha = 0.7) +
  annotate("text", x = 0.6, y = 0.82, label = "80% Power", hjust = 0, size = 3) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  labs(
    x = "Analysis Method",
    y = "Subgroup Detection Rate",
    title = "Power to Detect Harm Subgroup",
    subtitle = sprintf("HR(H) = %.2f, n = %d, %d simulations",
                       dgm_calibrated$hr_H_true, 
                       sim_config$n_sample,
                       sim_config$n_sims)
  ) +
  theme_minimal() +
  theme(legend.position = "none")

7.2 Hazard Ratio Distributions

# Prepare data for plotting
hr_long <- melt(
  results_alt,
  id.vars = c("sim", "analysis", "any.H"),
  measure.vars = c("hr.itt", "hr.H.hat", "hr.Hc.hat"),
  variable.name = "estimand",
  value.name = "hr"
)

hr_long$estimand <- factor(hr_long$estimand, 
  levels = c("hr.itt", "hr.H.hat", "hr.Hc.hat"),
  labels = c("ITT", "Harm Subgroup (H)", "Complement (H^c)")
)

# True values for reference
true_hrs <- data.frame(
  estimand = c("ITT", "Harm Subgroup (H)", "Complement (H^c)"),
  true_hr = c(dgm_calibrated$hr_causal, 
              dgm_calibrated$hr_H_true, 
              dgm_calibrated$hr_Hc_true)
)

ggplot(hr_long[!is.na(hr_long$hr), ], 
       aes(x = estimand, y = hr, fill = analysis)) +
  geom_violin(position = position_dodge(0.8), alpha = 0.7) +
  geom_boxplot(position = position_dodge(0.8), width = 0.15, 
               outlier.size = 0.5, alpha = 0.9) +
  geom_hline(data = true_hrs, aes(yintercept = true_hr), 
             linetype = "dashed", color = "red") +
  geom_hline(yintercept = 1, linetype = "solid", color = "gray50", alpha = 0.5) +
  scale_y_log10() +
  labs(
    x = "Estimand",
    y = "Hazard Ratio (log scale)",
    fill = "Method",
    title = "Distribution of Hazard Ratio Estimates",
    subtitle = "Dashed red lines indicate true values"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

7.3 Classification Performance

# Classification metrics
class_metrics <- results_alt[any.H == 1, .(
  Sensitivity = mean(sensitivity, na.rm = TRUE),
  Specificity = mean(specificity, na.rm = TRUE),
  PPV = mean(ppv, na.rm = TRUE),
  NPV = mean(npv, na.rm = TRUE)
), by = analysis]

class_long <- melt(class_metrics, id.vars = "analysis", 
                   variable.name = "metric", value.name = "value")

ggplot(class_long, aes(x = metric, y = value, fill = analysis)) +
  geom_col(position = position_dodge(0.8), width = 0.7) +
  geom_hline(yintercept = 0.8, linetype = "dashed", alpha = 0.5) +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  labs(
    x = "Classification Metric",
    y = "Value",
    fill = "Method",
    title = "Classification Performance (When Subgroup Found)",
    subtitle = "Sensitivity = P(in Ĥ | in H), PPV = P(in H | in Ĥ)"
  ) +
  theme_minimal()

7.4 Subgroup Size Distribution

# Size distribution when subgroup found
size_data <- results_alt[any.H == 1]

ggplot(size_data, aes(x = size.H, fill = analysis)) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
  geom_vline(xintercept = sim_config$n_sample * mean(dgm_calibrated$df_super_rand$flag.harm),
             linetype = "dashed", color = "red") +
  facet_wrap(~ analysis, ncol = 1) +
  labs(
    x = "Estimated Subgroup Size",
    y = "Count",
    title = "Distribution of Estimated Subgroup Sizes",
    subtitle = "Dashed line = expected true subgroup size"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

8 Theoretical Subgroup Detection Rate Approximation

The function compute_detection_probability() provides an analytical approximation based on asymptotic normal theory:

#| label: theoretical-power
#| fig-width: 8
#| fig-height: 6

# # Quick check using older code
# hr1 <- 1.25
# hr2 <- 1.0
# library(cubature)
# # Mild censoring
# n.sg <- 60
# pC <- 0.20
# hrH.plims<-seq(0.5,3.0,length=60)
# hrH.plims <- sort(c(hrH.plims,0.5,0.7,0.75,0.80))
# 
# pAnyH.approx <-rep(NA,length(hrH.plims))
# for(hh in 1:length(hrH.plims)){
#   hrH.plim <- hrH.plims[hh]
#   pAnyH.approx[hh] <- compute_detection_probability(
#  theta = hrH.plim,
#   n_sg = n.sg,
#   prop_cens = pC,
#   hr_threshold = hr1,
#   hr_consistency = hr2,
#   method = "cubature"
# )
# }
# plot(hrH.plims,pAnyH.approx,xlab="Hazard ratio for subgroup H",ylab="Probability of any H found",type="l",lty=1,lwd=2.0,ylim=c(0,1))
# abline(h=0.10,lwd=0.5,col="red")
# Looks good!

# =============================================================================
# Theoretical Detection Probability Analysis
# =============================================================================

# Calculate expected subgroup characteristics
n_sg_expected <- sim_config$n_sample * mean(dgm_calibrated$df_super_rand$flag.harm)
prop_cens <- mean(results_alt$p.cens)  # Censoring proportion

cat("=== Subgroup Characteristics ===\n")
=== Subgroup Characteristics ===
cat("Expected subgroup size (n_sg):", round(n_sg_expected), "\n")
Expected subgroup size (n_sg): 89 
cat("Censoring proportion:", round(prop_cens, 3), "\n")
Censoring proportion: 0.484 
cat("True HR in H:", round(dgm_calibrated$hr_H_true, 3), "\n")
True HR in H: 2 
cat("HR threshold:", fs_params$hr.threshold, "\n")
HR threshold: 1.25 
# -----------------------------------------------------------------------------
# Single-Point Detection Probability
# -----------------------------------------------------------------------------

# True H is dgm_calibrated$hr_H_true
# However we want at plim of observed estimate
#plim_hr_hatH <- c(summary_alt[c("hat(hat[H])"),1])

dgm_calibrated$hr_H_true
   treat 
2.000339 
# Compute detection probability at the true HR
prob_detect <- compute_detection_probability(
 theta = dgm_calibrated$hr_H_true,
  n_sg = round(n_sg_expected),
  prop_cens = prop_cens,
  hr_threshold = fs_params$hr.threshold,
  hr_consistency = fs_params$hr.consistency,
  method = "cubature"
)

# Compare theoretical to empirical (alternative)
cat("\n=== Detection Probability Comparison ===\n")

=== Detection Probability Comparison ===
cat("Theoretical FS (asymptotic):", round(prob_detect, 3), "\n")
Theoretical FS (asymptotic): 0.89 
cat("Empirical FS:", round(mean(results_alt[analysis == "FS"]$any.H), 3), "\n")
Empirical FS: 0.911 
cat("Empirical FSlg:", round(mean(results_alt[analysis == "FSlg"]$any.H), 3), "\n")
Empirical FSlg: NaN 
if ("GRF" %in% results_alt$analysis) {
  cat("Empirical GRF:", round(mean(results_alt[analysis == "GRF"]$any.H), 3), "\n")
}
Empirical GRF: 0.805 
# Null 

#plim_hr_itt <- c(summary_alt[c("hat(ITT)all"),1])

# Compute detection probability at the true HR
prob_detect_null <- compute_detection_probability(
 theta = dgm_null$hr_causal,
  n_sg = round(sim_config$n_sample),
  prop_cens = prop_cens,
  hr_threshold = fs_params$hr.threshold,
  hr_consistency = fs_params$hr.consistency,
  method = "cubature"
)


# Compare theoretical to empirical (alternative)
cat("\n=== Detection Probability Comparison ===\n")

=== Detection Probability Comparison ===
cat("Theoretical FS (asymptotic):", round(prob_detect_null, 6), "\n")
Theoretical FS (asymptotic): 0 
cat("Empirical FS:", round(mean(results_null[analysis == "FS"]$any.H), 6), "\n")
Empirical FS: 0.0124 
cat("Empirical FSlg:", round(mean(results_null[analysis == "FSlg"]$any.H), 6), "\n")
Empirical FSlg: NaN 
if ("GRF" %in% results_null$analysis) {
  cat("Empirical GRF:", round(mean(results_null[analysis == "GRF"]$any.H), 6), "\n")
}
Empirical GRF: 0.0208 
# -----------------------------------------------------------------------------
# Generate Full Detection Curve
# -----------------------------------------------------------------------------

# Generate detection probability curve across HR values
detection_curve <- generate_detection_curve(
  theta_range = c(0.5, 3.0),
  n_points = 50,
  n_sg = round(n_sg_expected),
  prop_cens = prop_cens,
  hr_threshold = fs_params$hr.threshold,
  hr_consistency = fs_params$hr.consistency,
  include_reference = TRUE,
  verbose = FALSE
)

# -----------------------------------------------------------------------------
# Visualization
# -----------------------------------------------------------------------------

# Plot detection curve with empirical overlay
plot_detection_curve(
  detection_curve,
  add_reference_lines = TRUE,
  add_threshold_line = TRUE,
  title = sprintf(
    "Detection Probability Curve (n=%d, cens=%.0f%%, threshold=%.2f)",
    round(n_sg_expected), 100 * prop_cens, fs_params$hr.threshold
  )
)

# Add empirical results as points
empirical_rates <- c(
  FS = mean(results_alt[analysis == "FS"]$any.H),
  FSlg = mean(results_alt[analysis == "FSlg"]$any.H)
)
if ("GRF" %in% results_alt$analysis) {
  empirical_rates["GRF"] <- mean(results_alt[analysis == "GRF"]$any.H)
}

# Mark the true HR and empirical detection rates
points(
  x = rep(dgm_calibrated$hr_H_true, length(empirical_rates)),
  y = empirical_rates,
  pch = c(16, 17, 18)[1:length(empirical_rates)],
  col = c("blue", "darkgreen", "purple")[1:length(empirical_rates)],
  cex = 1.5
)

# Add vertical line at true HR
abline(v = dgm_calibrated$hr_H_true, lty = 2, col = "blue", lwd = 1)

# Legend for empirical points
legend(
  "topleft",
  legend = c(
    sprintf("H true = %.2f", dgm_calibrated$hr_H_true),
    paste(names(empirical_rates), "=", round(empirical_rates, 3))
  ),
  pch = c(NA, 16, 17, 18)[1:(length(empirical_rates) + 1)],
  lty = c(2, rep(NA, length(empirical_rates))),
  col = c("blue", "blue", "darkgreen", "purple")[1:(length(empirical_rates) + 1)],
  cex = 0.8,
  bty = "n"
)

# -----------------------------------------------------------------------------
# Compare Across Sample Sizes (Optional)
# -----------------------------------------------------------------------------

# Show how detection probability changes with sample size
cat("\n=== Detection Probability by Sample Size ===\n")

=== Detection Probability by Sample Size ===
cat(sprintf("At true HR = %.2f:\n", dgm_calibrated$hr_H_true))
At true HR = 2.00:
for (n_mult in c(0.5, 1.0, 1.5, 2.0)) {
  n_test <- round(n_sg_expected * n_mult)
  prob_test <- compute_detection_probability(
    theta = dgm_calibrated$hr_H_true,
    n_sg = n_test,
    prop_cens = prop_cens,
    hr_threshold = fs_params$hr.threshold,
    hr_consistency = fs_params$hr.consistency
  )
  cat(sprintf("  n_sg = %3d (%.1fx): P(detect) = %.3f\n", 
              n_test, n_mult, prob_test))
}
  n_sg =  44 (0.5x): P(detect) = 0.753
  n_sg =  89 (1.0x): P(detect) = 0.890
  n_sg = 133 (1.5x): P(detect) = 0.948
  n_sg = 178 (2.0x): P(detect) = 0.975

9 Advanced Topics

9.1 Adding Noise Variables

Test ForestSearch robustness by including irrelevant noise variables:

# Run simulations with noise variables
results_noise <- foreach(
  sim = 1:sim_config$n_sims,
  .combine = rbind,
  .errorhandling = "remove",
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
  
  run_simulation_analysis(
    sim_id = sim,
    dgm = dgm_calibrated,
    n_sample = sim_config$n_sample,
    confounders_base = confounders_base,
    n_add_noise = 10,  # Add 10 noise variables
    run_fs = TRUE,
    fs_params = fs_params,
    verbose = FALSE
  )
}

# Compare detection rates
cat("Without noise:", round(mean(results_alt$any.H), 3), "\n")
cat("With 10 noise vars:", round(mean(results_noise$any.H), 3), "\n")

9.2 Sensitivity Analysis: Varying Parameters

# Test different HR thresholds
thresholds <- c(1.10, 1.25, 1.50)
results_by_thresh <- list()

for (thresh in thresholds) {
  results_by_thresh[[as.character(thresh)]] <- foreach(
    sim = 1:100,
    .combine = rbind,
    .options.future = list(
      packages = c("forestsearch", "survival", "data.table"),
      seed = TRUE
    )
  ) %dofuture% {
    
    run_simulation_analysis(
      sim_id = sim,
      dgm = dgm_calibrated,
      n_sample = sim_config$n_sample,
      confounders_base = confounders_base,
      run_fs = TRUE,
      fs_params = modifyList(fs_params, list(hr.threshold = thresh)),
      verbose = FALSE
    )
  }
  results_by_thresh[[as.character(thresh)]]$threshold <- thresh
}

# Combine and summarize
combined <- rbindlist(results_by_thresh)
combined[, .(power = mean(any.H), ppv = mean(ppv, na.rm = TRUE)), 
         by = .(threshold, analysis)]

10 Saving Results

# Save simulation results for later use
save_simulation_results(
  results = results_alt,
  dgm = dgm_calibrated,
  summary_table = summary_alt,
  runtime_hours = runtime_mins / 60,
  output_file = "forestsearch_simulation_alt.Rdata",
  power_approx = power_approx
)

save_simulation_results(
  results = results_null,
  dgm = dgm_null,
  summary_table = summary_null,
  runtime_hours = runtime_mins / 60,
  output_file = "forestsearch_simulation_null.Rdata"
)

11 Complete Example Script

Here’s a minimal self-contained script for running a simulation study:

# ===========================================================================
# Complete ForestSearch Simulation Study - Minimal Example
# ===========================================================================

library(forestsearch)
library(data.table)
library(survival)
library(foreach)
library(doFuture)

# --- Configuration ---
N_SIMS <- 500
N_SAMPLE <- 500
TARGET_HR_HARM <- 1.5

# --- Setup parallel processing ---
plan(multisession, workers = 4)
registerDoFuture()

# --- Create DGM ---
k_inter <- calibrate_k_inter(target_hr_harm = TARGET_HR_HARM, verbose = TRUE)
dgm <- create_gbsg_dgm(model = "alt", k_inter = k_inter, verbose = TRUE)

# --- Run simulations ---
confounders <- c("z1", "z2", "z3", "z4", "z5", "size", "grade3")

results <- foreach(
  sim = 1:N_SIMS, 
  .combine = rbind,
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
  run_simulation_analysis(
    sim_id = sim,
    dgm = dgm,
    n_sample = N_SAMPLE,
    max_follow = 60,
    confounders_base = confounders,
    run_fs = TRUE,
    run_fs_grf = TRUE,
    run_grf = FALSE,
    fs_params = list(hr.threshold = 1.25, fs.splits = 300, maxk = 2)
  )
}

# --- Summarize ---
summary_table <- summarize_simulation_results(results)
print(summary_table)

# --- Display formatted table ---
format_oc_results(summary_table, n_sims = N_SIMS, model_type = "alt")

12 Summary

This vignette demonstrated the complete workflow for evaluating ForestSearch performance through simulation:

Step Function Purpose
1. Create DGM create_gbsg_dgm() Define data generating mechanism
2. Calibrate calibrate_k_inter() Achieve target subgroup HR
3. Simulate simulate_from_gbsg_dgm() Generate trial data
4. Analyze run_simulation_analysis() Run ForestSearch/GRF
5. Summarize summarize_simulation_results() Aggregate metrics
6. Display format_oc_results() Create gt tables

Key metrics to report:

  • Power (H1) / Type I Error (H0): Subgroup detection rate
  • Sensitivity: P(identified | true harm subgroup)
  • Specificity: P(not identified | true complement)
  • PPV: P(true harm | identified)
  • NPV: P(true complement | not identified)

13 Session Info

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doFuture_1.1.2          future_1.67.0           foreach_1.5.2          
[4] gt_1.1.0                ggplot2_4.0.1           survival_3.8-6         
[7] data.table_1.17.8       weightedsurv_0.1.0      forestsearch_0.0.0.9000

loaded via a namespace (and not attached):
 [1] sass_0.4.10          generics_0.1.4       xml2_1.4.0          
 [4] shape_1.4.6.1        stringi_1.8.7        lattice_0.22-7      
 [7] cubature_2.1.4-1     listenv_0.9.1        digest_0.6.37       
[10] magrittr_2.0.4       grf_2.5.0            evaluate_1.0.5      
[13] grid_4.5.2           RColorBrewer_1.1-3   iterators_1.0.14    
[16] policytree_1.2.3     fastmap_1.2.0        jsonlite_2.0.0      
[19] Matrix_1.7-4         glmnet_4.1-10        scales_1.4.0        
[22] codetools_0.2-20     cli_3.6.5            rlang_1.1.6         
[25] parallelly_1.45.1    future.apply_1.20.0  splines_4.5.2       
[28] withr_3.0.2          yaml_2.3.10          tools_4.5.2         
[31] parallel_4.5.2       dplyr_1.1.4          globals_0.18.0      
[34] vctrs_0.6.5          R6_2.6.1             lifecycle_1.0.4     
[37] stringr_1.5.2        randomForest_4.7-1.2 fs_1.6.6            
[40] htmlwidgets_1.6.4    pkgconfig_2.0.3      progressr_0.16.0    
[43] pillar_1.11.1        gtable_0.3.6         glue_1.8.0          
[46] Rcpp_1.1.0           xfun_0.54            tibble_3.3.0        
[49] tidyselect_1.2.1     rstudioapi_0.17.1    knitr_1.50          
[52] farver_2.1.2         htmltools_0.5.8.1    labeling_0.4.3      
[55] rmarkdown_2.30       compiler_4.5.2       S7_0.2.1            

14 References

  1. León LF, Marceau-West CT, He W, et al. (2024). “Identifying Patient Subgroups with Differential Treatment Effects: A Forest Search Approach.” Statistics in Medicine.

  2. Athey S, Imbens GW. (2016). “Recursive partitioning for heterogeneous causal effects.” PNAS, 113(27):7353-7360.

  3. Wager S, Athey S. (2018). “Estimation and inference of heterogeneous treatment effects using random forests.” JASA, 113(523):1228-1242.